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Abstract 

A large amount of numerically-oriented code is written and is being 
written in legacy languages. Much of this code could, in principle, 
make good use of data-parallel throughput-oriented computer ar¬ 
chitectures. Loo.py, a transformation-based programming system 
targeted at GPUs and general data-parallel architectures, provides 
a mechanism for user-controlled transformation of array programs. 
This transformation capability is designed to not just apply to pro¬ 
grams written specifically for Loo.py, but also those imported from 
other languages such as Fortran. It eases the trade-off between 
achieving high performance, portability, and programmability by 
allowing the user to apply a large and growing family of transfor¬ 
mations to an input program. These transformations are expressed 
in and used from Python and may be applied from a variety of set¬ 
tings, including a pragma-like manner from other languages. 

Categories and Subject Descriptors D [5]: 4— Code generators; 

D [i]: 3— Concurrent programming; G [ 4 ]— Mathematical soft¬ 
ware 

Keywords Code generation, high-level language, GPU, substitu¬ 
tion rule, embedded language, high-performance, program trans¬ 
formation, OpenCL, Fortran 

1. Introduction 

Loo.py iKlocknerll2014l) is a programming system for array com¬ 
putations that targets CPUs, GPUs, and other, potentially heteroge¬ 
neous compute architectures. One salient feature of Loo.py is that 
programs written in it necessarily consist of two parts: 

• A semi-mathematical statement of the array computation to 
be carried out, in terms of a loop polyhedron and a partially 
ordered set of ‘instructions’. 

• A sequence of kernel transformations, driven by an ‘outer ’ pro¬ 
gram in the high-level scripting language Python ( Ivan R ossum et al. 
1994). 

This strong separation is an explicit design goal, as it enables 
specialization of users, cleanliness of notation in either part, as well 
as greater flexibility in terms of transformation. 


[Copyright notice will appear here once ’preprint’ option is removed.] 


While a prior article( [Kiockneil[2014l) emphasized Loo.py’s pro¬ 
gram model and semantics, this article focuses on the transformation- 
related aspects of the system. 

Loo.py was designed to suit a number of different use cases, all 
of which have shaped its design: 

• a means to concisely express computational kernels in the de¬ 
sign of scientific computing applic ations ( such as solvers for 
partial differential equations liKlockner et al.ll2009l) ). 

• a foundation for outlining the search space to be explored by an 
autotuning component or a human performance tuner, 

• an on-the-fly code generator for computational software, 

• a code-generation back-end enabling high-level DSLs to obtain 
performance on heterogeneous architectures, and 

• a program transformation tool for de- and re-optimizing legacy 
code. 


The present article demonstrates how Loo.py can function as a 
code generation back-end for a subset of Fortran (as an example 
of a language separate from Loo.py’s own internal representation) 
while maintaining its full capability to transform the ingested code 
in a manner comprehensible and useful to the author of the original 
program. A number of mechanisms are described that are intended 
to aid the formulation of transformations on array computations in 
this setting. 

As one example of the issues that arise, the strong separation 
of semantics and transformation, while desirable, also poses a dif¬ 
ficulty. For example, unlike in an annotation-based setting, where 
lexical proximity alone can be used to indicate what part of a pro¬ 
gram is to be transformed, this option does not exist for Loo.py, and 
so alternatives have to be devised. 

The literature on code generation and optimization for ar¬ 
ray languages is vast, and no attempt will be made to provide 
a survey of the subject in any meaningful way. Instead, we will 
seek to highlight a few approaches that have significantly in¬ 
fluenced the thinking behind Loo.py, are particularly similar, 
or provide ideas for further development. Loo.py is heavily in¬ 
spired by the polyhed r al model of ex pressing static-control pro¬ 
grams (Bastoul 20041; iFeautrieil 1 19961) . While it takes signifi¬ 
cant inspiration from this approach, the details of how a pro¬ 
gram is represented, beyond the existence of a loop domain, 
are quite different. High-performance compilation for GPUs, 
by now, is hardly a new topic, and many different approaches 
have been used, including ones using OpenMP-style directives 
ill an and Abdelrah manl201 H;lLee a nd Eieenmannll2010j) . ones that 
are fully a utomati c iYang et al.l l2Qloh . ones based on functional 
languages (Svensson et al. 2010), and ones based on the polyhedral 
model (Verdool aege et al.ll2013 ), Other ones define an a utomatic, 
array computation middleware iGarg and Hendrenll2012l) designed 
as a back-end for multiple languages, including Python. Automatic, 
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GPU-tar geted compilers for la n guages embedded in Python also 
abound jCatan/aro^LajJ 1201 ll : IContinuum Analytics, IncJ 120141 : 
iRubinstevn et al.ll2012l) . most of which transform a Python AST 
at run-time based on various levels of annotation and operational 
abstraction. 

Code generators just targeting one or a few specific workloads 
(often matrix-matrix multiplication) using many of the same tech¬ 
niques available in Loo.py have been presented by various au¬ 
thors, ranging from early work such PhiPAC iBilmes e t alj 1997 ) 
to more recent OpenCL - and CUDA-based work dCui et al.l 2011 ; 
iMatsumoto et all20 12l) . 

Other optimizing compilers assume a substantial amount of 
domain knowledge (such as what is needed for assembly of finite 
element matrices) and leverage this to obtain parallel, optimized 
code. One example of this family of code generators is COFFEE 
l lLuporini et all 20151) . 

Perhaps the conceptually closes t prior w ork to the approach 
taken by Loo.py is CUDA-CFliLL (iRudv et al.l 12011 ). which per¬ 
forms source-to- source translation b ased on a set of u ser-controlled 
transformations dChen et alJ 120081 : lHall et alJ l2010h . Loo.py and 
CFliLL still are not quite alike, using dissimilar intermediate rep¬ 
resentations, dissimilar levels of abstraction in the description of 
transformations, and a dissimilar (static vs. program-controlled) ap¬ 
proach to transformation. 

Source-to-source transformation similarly has been studied ex¬ 
tensively,wjthjnmw madire^Ystenis_existingJn thejiterature (for 
instance dDave et ai1l2009l ; ISchordan and Quinlanll2003l) ). 

2. Loo.py’s view of a kernel 

We begin by briefly examining Loo.py’s model of a program (or 
‘kernel’). A very simple example of a kernel shall serve as an 
introduction. This kernel reads in one vector, doubles it, and writes 
the result to another: 

1 - - 

knl = loopy.make_kernel( 

"{[i]: 0<=i<n}", # loop domain 
"out[i] = 2*a[i]") # instructions 


The above snippet of code illustrates the main components of a 
Loo.py kernel: 

• The loop domain : { [i] : 0<=i<n }. This defines the integer 
values of the loop variables for which instructions (see below) 
will_be executed. It is written in the syntax of the isl library 
dVerdoolaegell20ld) . Loo.py calls the loop variables inames. In 
this case, i is the sole iname. n is a parameter that is passed to 
the kernel by the user, n in this case determines the length of 
the vector being operated on. 

To accommodate some data-dependent control flow, there is not 
actually a single loop domain, but rather a tree of loop domains, 
allowing more deeply nested domains to depend on inames 
introduced by domains closer to the root. 

• The instructions to be executed: out [i] = 2*a [i]. These are 
scalar assignments between array elements, consisting of a left- 
hand side assignee and a right-hand side expression. Right-hand 
side expressions are allowed to contain the usual mathematical 
operators, calls to externally defined functions, and references 
to substitution rules (see Section |4~H . 

In addition to the left-hand- and right-hand-side expressions de¬ 
scribing the assignment, each instruction carries the following 
data: 

■ An instruction identifier. A string that uniquely identifies 
each instruction. Automatically generated if not specified. 
In addition to specifying the entire unique ID, a ‘prefix’ may 
also be specified, based on which a unique ID is generated. 


■ A set of instruction tags. Used for transformation targeting 
(see Section l4~2l i. 

■ A set of inames specifying within which loops this instruc¬ 
tion is intended to be nested. A heuristic dKlbckneill20l4) is 
applied to automatically discover this information. The in¬ 
ame nesting may be overridden by the user if the heuristic 
does not yield the intended result. 

■ A set of instruction IDs depended upon, i.e. required to 
be executed before the current instruction. As described 
in dKlocknerl 120141) . these dependencies act at the inner¬ 
most loop nesting level shared between the dependent and 
depended-upon instruction. Like the nest-within inames, a 
default set of dependencies is found by a heuristic that cre¬ 
ates dependencies on instructions that write those variables 
that are read by this instruction. 

■ A set of predicates, the conjunction of which determines 
the condition under which the instruction will be executed. 
Each predicate refers to a stored ‘flag’ variable or its nega¬ 
tion. This flag variable must have been set previously, and 
it serves as a source for automatically generated dependen¬ 
cies. 

3. Transforming Fortran into Loo.py 

While Loo.py’s native intermediate representation is sufficiently 
abstract and convenient that it is suited to being used directly by 
a user/programmer, one main use case for Loo.py is to be a back¬ 
end to other systems whose result is a machine representation of an 

array computation. _ 

To illustrate this use case, a Fortran jBackus et alJI 19571) front- 
end for Loo.py is described in the following. Along the way, this 
front-end provides a convenient case study of what can be done to 
enable program transformation in a setting where the structure of 
an input program is not designed to be convenient for rather but 
rather given by outside constraints, such as a decades-old standards 
document. 

Based on a number of restrictions (see below), the main objec¬ 
tive of Loo.py’s Fortran front-end is not (and cannot be) to be a fully 
standards-conforming Fortran compiler. Instead, it seeks to lessen 
the burden of capturing Loo.py kernels in Loo.py’s native repre¬ 
sentation, by providing an alternate input format with which a user 
may be more familiar. The continuing dominance of Fortran in sci¬ 
entific and engineering fields where computation is applied further 
means that providing transformation avenues to modern architec¬ 
tures is a possibly impactful way to leverage these legacy codes on 
modern-day architectures. 

The Fortran 77 model of computation is a surprisingly good 
match for Loo.py’s input language, sharing not just the array-based 
view of the data being operated on, but also much of its type 
system and its model of the subroutine as the main unit of program 
functionality. 

Compared to the more comprehensive Fortran 90, a number of 
restrictions exist: 

• No early exits (EXIT, CYCLE. RETURN), no mid-subroutine entry 
points (ENTRY), limited data-dependent control flow (essential) 

• No guaranteed order between trips through a loop (essential) 

• Translation acts on a single subroutine, which will be translated 
to a single OpenCL compute kernel, (liftable) 

• No I/O, no calls to other subroutines (liftable) 

• No pointers, limited support for structured types ( liftable) 

• No support for SAVE and COMMON data (liftable) 

• No array-level assignments and intrinsics (liftable) 
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• No dynamic memory management ( liftable) 

Each of the above restrictions is qualified with whether it is es¬ 
sential and unlikely to be lifted in future revisions (essential) or 
a matter of further software development (liftable). Some of these 
are a direct result of underlying limitations in the OpenCL kernel 
language for which Loo.py generates code. 

The following example shows a Fortran kernel being translated 
by Loo.py: 


subroutine fill(out, a, n) 
implicit none 
real*8 a, out(n) 
integer n 

do i = 1, n 
out(i) - a 
end do 
end 

!$loopy begin transform 
! fill = lp.split_iname(fill, "i", 128, 
! outer_tag=" g .0", inner_tag="l. 0") 

!$loopy end transform 


The code shows a straightforward vector fill kernel. Mainly the 
section between the $loopy begin/end transform markers (in 
Fortran comments) is of note. This section consists of Python code, 
and the Fortran subroutine defined above becomes available here as 
a Loo.py kernel object, under a Python identifier of the same name. 
At this point, the user is free to use the entire transform vocabulary 
defined in Loo.py (see dK1 dckne rl 1 201 4t > and Section []} on their 
kernel. 

The availability of all of the Python programming language 
for program transformation sets Loo.py apart fr om other ‘pragma’- 
type a pproaches to a nnotation such as O penMP (Dagum and Menon 
1 19981) or OpenACC dGroup et all201 lh as well as from other trans¬ 
formation script approaches such as CHiLL dChen et alJ 120081 : 
lHall et aOl20ld : fRudv et akll201 ll) . The following usage patterns 
are enabled by it: 

• Abstraction. Users are enabled to build their own, higher-level, 
compound transformations that may be shared among a family 
of kernels. For instance, a number of transformations changing 
the data layout of a computation could (and, likely, should) be 
shared among a group of kernels accessing said data. 

• Dynamism. Being based on a full-featured programming lan¬ 
guage allows the transform code to respond to its environment 
in interesting, non-trivial ways. As a simple example, a different 
transform path may be chosen depending on the target device 
for which code is to be generated. Alternatively, the transform 
code may consult a performance model or a database regarding 
the most promising transforms to apply. It could also be part of 
an auto-tuning scheme. 

• Introspection. Transforms (and the code calling them) are at lib¬ 
erty to inspect and reason about the kernel code. For example, 
it is straightforward to write a loop over a set of variables being 
written in a certain code region and apply prefetching or a data 
layout transformation to them. This helps keep the transform 
code general, adaptable, and reusable. 

These points emphasize the fact that Loo.py can be employed as a 
lower-level infrastructure component, providing enough expressive 
power for higher-level, more abstract transformations built on top 
of it. 

Loo.py takes the following steps when translating a Fortran 
kernel: 


• When a do loop is encountered, a new axis is added to the 
current loop domain. If necessary, the loop variable (‘iname’ 
in Loo.py-speak) and all its uses will be renamed to ensure 
uniqueness. 

• Fortran’s scalar assignments and data type/dimension declara¬ 
tions map directly onto the corresponding features in Loo.py. 

• When an if/then block is encountered, a predicate variable 
is created based on the condition in the if statement, and all 
Loo.py instructions created from the body of the if block have 
the predicate variable (or its negation, for the else sub-block) 
applied to it. 

• Since Fortran programs are strongly sequentially ordered, the 
translation creates a linear chain of dependencies matching the 
program order. 


A somewhat more challenging example including conditionals is 
shown below: 


do i = 1, n 
a = inp(i) 
if (a.ge.3) then 
b = 2*a 
do j = 1,3 
b = 3 * b 
end do 

out(i) = 5*b 
else 

out(i) = 4*a 
endif 
end do 


Loo.pytranslates this to the following C/OpenCL kernel code: 


for (int i = 0; i <= -1 + n; ++i) 
{ 

a = inp[i] ; 
loopy_cond0 = a >= 3; 
if (loopy_cond0) 

{ 

b = 2.0 * a; 

for (int j = 0; j <= 2; ++j) 
b = 3.0 * b; 
out [i] = 5.0 * b; 

> 

if (!Ioopy_cond0) 
out [i] = 4.0 * a; 

} 


It is worth noting that instead of evaluating the conditional for each 
instruction separately, Loo.py’s code generation stage is capable of 
grouping, also across loop entries/exits, to help reduce the cost of 
conditional execution. 

Loo.py instructions generated from segments of a Fortran pro¬ 
gram may have tags applied to them to ease their identifica¬ 
tion in the transformation process. This interacts with the trans¬ 
formation facilities in Loo.py and allows them to be applied 
to subsets of the program. This is accomplished through the 
! $loopy begin/end tagged marker in a Fortran comment: 
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!$loopy begin tagged: input 
a = cos(alpha)*inpl(i) + sin(alpha)*inp2(i) 
b = -sin(alpha)*inpl(i) + cos(alpha)*inp2(i) 
!$loopy end tagged: input 

r = sqrt(a**2 + b**2) 
a = a/r 
b = b/r 

out1(i) = a 
out2(i) = b 


This subset of the program can then be selected for transformation 
using the match expression ‘*$input’ See Section l4~2l for details. 

4. Transforming Array Computations 

Loo.py employs a number of strategies to allow the creation of 
maintainable, logical, and readable transformation code. One im¬ 
portant aspect of this is transform targeting, and a mechanism for 
performing this function is discussed next. 

4.1 Substitution rules 

Semantics. In addition to instructions (see Section[2]l, Loo.py ker¬ 
nels may contain ‘substitution rules’, which, as their most basic 
function, permit common subexpressions to be factored out and de¬ 
fined once. In addition to simple subexpressions, substitution rules 
also support parameters. The behavior of Loo.py’s substitution rule 
system is similar to other macro systems, albeit no flow control is 
provided for use during expansion. A similarity exists with the C 
preprocessor, although substitution rule processing takes place at 
the level of the expression tree rather than the token stream. 

Unless otherwise removed, substitution rules are automatically 
expanded immediately before code generation. The following sim¬ 
ple example illustrates their use: 

lp.make_kernel( 

"I[i,j,n,n2]: 0<=i,j<npart and 0<=n,n2<3}" , 

grav_force(m, M, r) := -66.7^2*m*M/r**2 

<> radc = sqrt (sum(n, (x[i ,n]-center [n] ) **2) ) 

<> rad_j = sqrt (sum(n2, (x[i ,n2] -x[j ,n2] )**2)) 

forced] = grav_force(mass[i], masse, radc) + \ 
sum(j, grav_force(mass [i], mass [j ], rad_j)) 

""") 


In Loo.py’s native kernel language, substitution rules are differen¬ 
tiated from assignment instructions by the use of a different assign¬ 
ment operator (“:=’) and, optionally, the use of round parentheses 
on the left-hand side of the assignment to delimit argument names. 

In addition to providing a convenience for coding complex 
computations, one major role of substitution rules in Loo.py is to 
provide an additional facility for attaching identifiers to parts of the 
computation. 

Creation. While Loo.py’s built-in language includes facilities 
for writing substitution rules directly, it is not reasonable to expect 
that every programming system to which Loo.py may be coupled 
will offer this possibility—the Fortran front-end of Section[3]is one 
such example. To retain the specificity contributed by substitution 
towards the transformation targeting problem (see Section 14. 2t . 
Loo.py provides several ways of creating substitution rules from 
‘bare code’: 

• Unification. Provided with a unification pattern, Loo.py can 
locate all subexpressions unifiable with it and convert them to 
invocations of a newly-created substitution rule. For example, 
the two subexpressions involving b in the assignment 


- - 

a[i] = 23*b[i]**2 + 25*b[i]**2 


are unified by 


knl = lp.extract_subst(knl, 

"bsquare", "alpha*b[i]**2" , parameters=( "alpha" ,)) 


which rewrites them to 


bsquare(alpha) := alpha*b[i_0 ]**2 
a[i] = bsquare(23) + bsquare(25) 


• Wrapping of variable read access. A particular example of 
unification, and in fact the most common one. Loo.py can 
wrap any reading access to an array or scalar variable in a 
substitution rule. Combined with precomputation (Section r4.3l l. 
this provides a mechanism for prefetching of off-chip variables. 

• Conversion of an assignment to temporary. Temporary vari¬ 
ables are often used to hold intermediate results for reuse. 
Loo.py provides a facility to convert such an assignment into 
a substitution rule. For example, the (Fortran) code 



using the temporary_to_subst transformation. As one exam¬ 
ple, this process of transitioning through a rule enables the pro¬ 
grammer to change the granularity or a precomputation to com¬ 
prise a larger or smaller footprint of the iteration domain. In 
some sense, this undoes a common subexpression elimination 
and is thus a type of de-optimization. 

4.2 Transformation targeting 

In transforming computational kernels, it is often undesirable to 
apply a transformation to an entire kernel. Instead, the user may 
wish to express specifically which instructions or which subexpres¬ 
sions a transform should act upon. Loo.py supports this use case by 
matching names/IDs and ‘tags’ of instructions and substitution rule 
invocations. 

An example may help clarify this: 

f(x) : = x*a[x] 

g(x) := 12 + f (x) 

h(x) := 1 + g(x) + 20*g$three(x) 

a[i] = h$one(i) * h$two(i) 


Three (nested) substitution rules are defined, f, g, and h. Many 
of the substitution rule invocations have a ‘tag’ applied to them 
(suffixed onto the rule identifier with a dollar sign, e.g. ‘h$two’). If 
necessary, this makes each rule invocation individually selectable. 
These tags have no influence on the meaning of the program. They 
only serve to make locations in the code identifiable. 

We apply the expand_subst transformation (which simply 
expands a substitution rule) to the invocation of g tagged three 
within the invocation of h tagged two: 
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knl = expand_subst(knl, "g$three < h$two") 


More generally, a user may match arbitrary portions of the rule 
expansion stack. The first component in the stack match expression 
(‘g$three’ in the above example) is necessarily the innermost 
level of expansion, and outer levels are separated by the < symbol. 
Each level consists of the ‘main identifier’, matching a substitution 
rule name or an instruction ID, and the ‘tag’, matching either an 
invocation tag on a substitution rule, or an instruction tag. Each of 
the two parts also supports shell-style wildcards. Multiple levels 
may be matched by an ellipsis (innermost < ... < outer). 

The above example results in the following code: 


f(x) : 

= x*a[x] 


g(x) : 

= 12 + f(x) 


h(x) : 

= 1 + g(x) + 

20*g$three(x) 

h_0(x) 

:= 1 + g(x) 

+ 20*(12 + f(i)) 

a[i] = 

h$one(i)*h_0$two(i) 


When expanding the specified invocation of g. not all invoca¬ 
tions of h (which contained the invocation of g) were affected. As 
a result, a new, separate version of h, named h_0 was created, and 
the relevant invocation sites of h were updated. 

It should be noted that this mechanism for transformation 
targeting is not limited to matching substitutions rules. Simi¬ 
lar to substitution rules, instructions also have names and tags, 
and the same notation applies. For example, a specific instruc¬ 
tion ID can be matched directly as ‘instruction_id‘, and all 
instructions whose tags match a given one may be matched by 
‘*$instruction_tag‘, where the wildcard * for the instruction 
ID does not impose any matching constraint. 


4.3 Substitution rules for precomputation 

For computations that make use of the same intermediate results 
multiple times, it may be desirable to store these results in some 
form of temporary memory until they are needed again. Similarly, 
computations targeting cache-constrained architectures that refer¬ 
ence the same off-chip data repeatedly may want to allocate on-chip 
temporary memory to avoid incurring the fetch latency for this data 
again and again. This challenge is met by Loo.py’s precompute () 
transformation, which generally helps programs trade off increased 
on on-chip storage against the cost off repeatedly fetching or com¬ 
puting needed intermediate results. 

To facilitate precise targeting of precomputation, Loo.py’s 
precompute () transformation operates exclusively on substitu¬ 
tion rules. Any subexpression for which precomputation is desired 
must first be converted to a substitution rule using the machinery 
of Section l4~TI 

Once a substitution rule has been created, the precompute 
transformation can be used to allocate storage and create instruc¬ 
tions to store the precomputed values. This is straightforward if the 
substitution rule simply represents a scalar value. More interesting 
cases arise if the value of the rule or one of its invocation argu¬ 
ments involve inames. In this case, a set of inames can be provided 
to precompute () which, when swept out, generate all values of 
the substitution rule which are to be precomputed. In this situation, 
enough storage is allocated to accommodate the access footprint, 
and an auxiliary set of inames is generated that sweep out the foot¬ 
print and drive the precomputation. Naturally, the precomputation 
logic can be applied with the same fine-grained targeting described 
in Section l4~2l 


5. Some examples 

5.1 Forward differencing 

Consider this example program which computes forward differ¬ 
ences on a 1-dimensional array of length n: 


knl = lp.make_kernel( 

"{[i]: 0<=i<n}" , 

"result [i] = u [i+1]-u[i] ") 


Since each entry of u is used twice, a plausible optimization for 
parallel architectures with limited caches (such as GPUs) is to store 
a group of values of u in storage closer to the processor. 

To achieve group-wise prefetching, we split the iteration domain 
into fixed-size pieces of length 16, assuming divisibility to ease 
understanding by avoiding the generation of many conditionals: 


knl = lp.split_iname(knl, "i", 16) 
knl = lp.assume(knl, "n mod 16 = 0") 


Next, we extract the access to u into a substitution rule u_acc and 
apply (sequential, for now) precomputation for each sweep through 
iterations of i_inner, and assuming all other inames remain con¬ 
stant: 


knl = lp.extract_subst(knl, "u_acc", "u[j]", 
parameters=" j ") 

knl = lp.precompute(knl, "u_acc", "i_inner", 
default_tag=None) 


We obtain the following C code: 


float u_acc_0[17]; 
for (int i_outer = 0; 

i_outer <= -1 + int_floor_div_pos_b(15 + n, 16); 
++i_outer) 

{ 

for (int j = 0; j <= 16; ++j) 

u_acc_0[j] = u[j + 16 * i_outer] ; 
for (int i_inner = 0; i_inner <= 15; ++i_inner) 
result[i_inner + i_outer * 16] = 

u_acc_0[1 + i_inner] + -l.Of * u_acc_0[i_inner]; 

} 


Precomputation has found the (17-long) footprint of the access 
to u for each group of 16 iterations through i_inner, created 
a suitable prefetch loop, and modified the variable references to 
match. Parallelization can the be applied to the generated loops 
as described in dKlbcknertl2014l ), as demonstrated in the following 
example. 


5.2 Matrix-Matrix multiplication 

This end-to-end Fortran-to-GPU example parallelizes a matrix- 
matrix multiplication loop for parallel execution, matches the ac¬ 
cess to each of the argument matrices into a substitution rule, and 
performs a block-wise prefetch: 
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subroutine dgemm(m,n,1,alpha,a,b,c) 
implicit none 

real*8 temp, a(m,l),b(l,n),c(m,n), alpha 
integer m,n,k,i,j,l 

do j = 1, n 
do k = 1,1 
do i = l,m 

c(i,j) = c(i,j) + alpha*b(k,j)*a(i,k) 
end do 
end do 
end do 

end subroutine 

!$loopy begin transform 
! dgemm = Ip . split_iname(dgemm, "i", 16, 

! outer_tag="g. 0", inner_t ag="l. 1") 

! dgemm = Ip . split_iname(dgemm, "j", 8, 

! outer_tag="g. 1", inner_tag="l .0") 

! dgemm = Ip.split_iname(dgemm, "k", 32) 

! 

! dgemm = Ip . extract_subst(dgemm, 

! "a_acc", "a[il,i2j", parameters="il, i2") 

! dgemm = Ip . extract_subst(dgemm, 

! "b_acc", "b[il,i2j", parameters="il, i2") 

! dgemm = Ip.precompute(dgemm, 

! "a_acc", "k_inner, i_inner") 

! dgemm = Ip . precompute(dgemm, 

! "b_acc", "j_inner,k_inner") 

!$loopy end transform 


6. Conclusions 

Loo.py provides a small, composable code generation capability for 
high-performance array code on CPU- and GPU-type shared mem¬ 
ory parallel computers. It is available under the MIT open-source 
license from http: //mathema. tician. de/software/loopy 
The core contributions described in this article and implemented 
in Loo.py are the following: 

• An transformation targeting scheme based on substitution rules 
and ‘tags' that can be used to very precisely specify what parts 
of an expression in a program is to be transformed. 

• A way of using a high-level language (Python in this instance) 
in a ‘pragma’-like role, for the transformation of a program in a 
lower-level kernel language. 

• A translation scheme from a subset of Fortran into Loo.py’s 
polyhedral-like representation. 

• The use and extraction of substitution rules to capture precisely 
what elements of a computation should be precomputed, and 
across which dependent axes. 

Loo.py’s kernel representation, its library of transformations, 
and its runtime features combine to provide a compelling environ¬ 
ment within which array-shaped computations can be conveniently 
expressed and optimized. Some examples in d Klocknerl 1 2014f ) il¬ 
lustrated that high-performance variants are within the set of pro¬ 
grams reachable via Loo.py transformations. This article described 
some techniques to help broaden the set of codes that can benefit 
from these transformations, providing a pathway to performance 
that does not compromise maintainability and separation of con¬ 
cerns. 
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